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Abstract 

Theobroma cacao is an economically important tree of several tropical countries. Its genetic improve- 
ment is essential to provide protection against major diseases and improve chocolate quality. We discov- 
ered and mapped new expressed sequence tag-single nucleotide polymorphism (EST-SNP) and simple 
sequence repeat (SSR) markers and constructed a high-density genetic map. By screening 149 650 
ESTs, 5246 SNPs were detected in silico, of which 1 536 corresponded to genes with a putative function, 
while 851 had a clear polymorphic pattern across a collection of genetic resources. In addition, 409 new 
SSR markers were detected on the CrioIIo genome. Lastly, 681 new EST-SNPs and 163 new SSRs were 
added to the pre-existing 418 co-dominant markers to construct a large consensus genetic map. This 
high-density map and the set of new genetic markers identified in this study are a milestone in cocoa gen- 
omics and for marker-assisted breeding. The data are available at http://tropgenedb.cirad.fr. 
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1 . Introduction 

Theobroma cacao L. is a diploid species (2/7 = 2x= 20) 
with a small genome ranging in size from 411 to 
494 Mb. 1 According to Cheesman, 2 its centre of 
origin is at the lower eastern equatorial slopes of the 
Andes. 



These authors contributed equally to this work. 



Theobroma cacao is grown as a major cash crop that 
provides income to 14 million small-scale farmers in 
more than 50 tropical countries. However, cocoa pro- 
duction is markedly affected by a number of major 
diseases caused by several Phytophthora species, or 
by Moniliophthora perniciosa and Moniliophthora 
roreri. Several sources of disease resistance have 
been identified and the search for sustainable 
disease resistance by cumulating the different resist- 
ance genes is one of the major challenges facing 
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T. cacao- breeding programmes. 3 The quality of choc- 
olate is another important trait in cocoa breeding, 
and consumer demand for high-quality chocolate is 
increasing. A better understanding of the molecular 
and genetic bases of these traits is a key goal of 
cocoa genetic research. 

High-density genetic maps are essential tools for trait 
genetic studies. Several molecular marker types have 
been developed in T. cacao in recent decades: restric- 
tion fragment length polymorphism (RFLP), microsa- 
tellites or simple sequence repeats (SSRs), random 
amplified polymorphic DNA, amplified fragment 
length polymorphism and isozymes. 4-6 Among them, 
only RFLP, SSR and single-nucleotide polymorphism 
(SNP) are co-dominant markers, and therefore more 
powerful for genetic analyses. Compared with RFLP, 
the advantage of SSR and SNP markers is that they 
can be revealed using high-throughput technologies 
with scant amounts of DNA. Semagn et at. 7 made a 
detailed comparison of the characteristics of each 
kind of marker. A high-density cocoa linkage map 
enriched with SSR genomic markers, including only 
co-dominant markers, was developed by Pugh et al. 8 
More recently, that map was supplemented with 1 1 4 
EST-SSRs. 9 

In recent years, the use of SNP markers has substan- 
tially increased in plant genetics such as in 
Arabidopsis,^ 0 grapevine, 11 wheat, 12 and also a few 
woody perennial species. 1 3-1 5 SNP is one of the 
most abundant types of DNA sequence polymorph- 
ism and the SNP markers are suitable for large-scale 
genome analysis using high-throughput automated 
genotyping techniques. SNPs have been used to con- 
struct high-resolution genetic maps 16,17 or to trace 
evolution, particularly in the human genome, using 
large-scale SNP datasets. 1 8,1 9 Knowledge of nucleo- 
tide substitution dynamics is an important basis for 
molecular evolutionary studies, phylogeny reconstruc- 
tion and natural selection studies. 20,21 Transitions are 
generally observed with higher frequencies than 
transversions. During natural selection, transitions 
are better tolerated because they generate more 
likely synonymous mutations in protein-coding 
sequences than transversions. 22-25 

Of existing SNP markers, EST-SNPs (i.e. SNPs located 
within a gene expressed sequence) are of particular 
interest for studying functional genetic diversity and 
identifying candidate genes as the functional base of 
quantitative trait loci (QTLs). EST-SNPs have been 
developed for numerous plant models such as 
melon, 26,27 Brassica rapa, 28 barley, 29 poplar, 14 and 
sugarcane 30 to detect QTLs for many traits and facili- 
tate the selection of resistant and productive plants. In 
T. cacao, a few SNPs were detected in ESTs from 
expression libraries representing T. cacao/M. perni- 
ciosa interactions. 31 



In our study, we discovered and mapped several 
hundred EST-SNP markers detected in an exhaustive 
collection of cocoa ESTs 32 homologous to genes 
with a known function. These SNP markers were sup- 
plemented by 1 63 new SSR markers to construct a 
very high-density genetic map suitable for large- 
scale genetic studies. 

2. Materials and Methods 

2.1. Plant material 

SNP polymorphisms were screened in a collection of 
diverse germplasm representing the major part of the 
T. cacao diversity and two existing mapping popula- 
tions denominated UPA402 x UF676 and F2. 

The collection of diverse germplasm consisted of 
249 genotypes from various genetic groups and geo- 
graphical origins (Table 1). Most of these accessions 
are maintained at the International Cocoa 
Genebank (ICG) at the Cocoa Research Unit (CRU), 
University of the West Indies, Trinidad and Tobago. 

The UPA402 x UF676 mapping population con- 
sisted of 264 individuals derived from a cross of two 
unrelated heterozygous tree accessions; UPA402, an 
Upper Amazon Forastero from Peru, and UF676, a 
Trinitario (Forastero x Criollo hybrid) selected in 
Costa Rica. This progeny was maintained by Centre 
National de Recherche Agronomique (CNRA) in 
Bingerville and Divo, Cote d'lvoire. It was used to 



Table 1. Theobroma cacao genotypes of various geographical 
origins used to screen the polymorphism of the 1 536 
GoldenGate SNP panel 



Accession collection group 
name 


Number of 
genotypes 


Geographical 
origin 


AMAZ 


2 


Ecuador 


APA 


1 


Colombia 


Nacional 


3 


Ecuador 


Criollo 


14 


Mexico-Belize 


EBC 


4 


Colombia 


Trinitario 


28 


Trinidad 


GU 


1 2 


French Guiana 


IMC 


1 9 


Peru 


LCTEEN 


46 


Ecuador 


MORONA 


3 


Peru 


NANAY 


49 


Peru 


PARINARI 


40 


Peru 


POUND 


6 


Peru 


SC 


5 


Colombia 


SCAVINA 


8 


Peru 


Amelonado type 


3 


Brazil 


SPEC 


6 


Colombia 


Total 


249 
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establish genetic map as a reference in our labora- 
tory 4 ' 6,8,9 We mapped the new SSR and SNP 
markers in this population. 

The F2 second progeny of 132 individuals was 
obtained by selfing a hybrid between two heterozy- 
gous parents: Scavina 6, an Upper Amazon Forastero 
collected in Peru, and ICS! , a Trinitario selected in 
Trinidad. This progeny was produced by Comissao 
Executiva do Piano da Lavoura Cacaueira (CEPLAC) 
at Itabuna, Brazil. 

2.1.1. Genotypes used for EST-SNP detection and 
selection 

Most of the ESTs screened for SNPs had been 
obtained from the contrasting genotypes Scavina 6, 
an upper Amazon Forastero genotype from Peru and 
ICS1, a Trinitario selected in Trinidad, a hybrid 
between a Criollo from Central America and a 
Forastero from Lower Amazonia of Brazil. 

These two genotypes, which represent the three 
distinct genetic origins, Upper Amazon Forastero, 
Lower Amazon Forastero, and Criollo, were also the 
parents of the F2 population from Brazil used to 
map SNPs. 

Eleven other genotypes were involved in the con- 
struction of the cDNA libraries and SNP identification: 
B97-CC2, a Criollo from Belize, P7, IMC47, UPA 1 34, 
Upper Amazon Forastero genotypes from Peru, Jaca, 
an Upper Amazon Forastero from Brazil, GU2 55V, col- 
lected in French Guiana, B240 and 33-49, two 
Nacional genotypes from Ecuador, UF676, UF273, 
two Trinitario, and seedlings from a hybrid selected 
in Papua New Guinea. 

SSRs were screened in three T. cacao genotypes: the 
two parents of the reference map (UPA402 and 
UF676), and the sequenced Criollo genotype (B97- 
61 /B2). 

2.2. DNA extraction and purification 

Genomic DNA was extracted according to a protocol 
using MATAB buffer already described for the isolation 
of genomic DNA. 6 DNA was resuspended with 1 ml of 
TE (1 0 mM Tris-HCl and 1 mM EDTA, pH 8.0). 

DNA was purified with the Nucleobond® PC 20 kit 
(Macherey-Nagel, Cat. No. 740.571.100) with the 
modification that steps 1 and 2 were omitted and 
the DNA was purified directly after its isolation. A 
1 ml mixture composed of 200 |xl of crude DNA 
(20 |xg DNA maximum), 450 |xl water and 350 |xl 
S3 buffer + RNAse (buffers provided with the kit) 
was passed through the column (step 3). This solution 
was homogenized on a rocking table for at least 1 h. 
After precipitation of the eluate with an equal 
volume of isopropyl alcohol, the pellet was resus- 
pended in 70 jul! of TE. 



The quality and quantity of DNA were first checked 
on 0.8% agarose gel, compared with a standard range, 
and then the Quant-iT™ PicoGreen® dsDNA Assay 
from Invitrogen™ was used. A quality test was per- 
formed for each sample by amplifying microsatellite 
markers in a PCR mixture with a high DNA concentra- 
tion (100 ng DNA in a 1 0 jjlI reaction volume). The 
purification step was repeated when the amplification 
failed. 

2.3. In silico SNP discovery and verification 

A collection of 149 650 ESTs (EMBL accession 
number CU469588 to CU6331 56), corresponding 
to 48 594 unigenes, was produced after sequencing 
56 cDNA libraries constructed from material collected 
from different organs, genotypes, and under different 
environmental conditions. 32 

SNPs were detected in silico and quality checked 
using the QualitySNP pipeline, 33 as reported in 
Argout et al. 32 and in ESTtik (http://esttik.cirad.fr). 

QualitySNP uses quality information related to each 
EST and a haplotype-based strategy to predict reliable 
SNPs. In order to detect SNPs in known homologous 
coding sequences, we selected contigs displaying a sig- 
nificant similarity with proteins from a non-redundant 
protein sequence database (NR), with entries from 
GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq, 
and as described in Argout et al. 32 

2.4. Validation of SNPs via golden gate assay 

A total of 30-50 ng of genomic DNA per plant was 
used for lllumina SNP genotyping with the lllumina 
BeadArray platform at the French National 
Genotyping Centre (CNG, CEA-IG, Evry, France), 
according to the GoldenGate Assay manufacturer's 
protocol. Three 3-day assays were carried out to geno- 
type the progeny samples for the 1 536 SNP set 
revealed by the GoldenGate assay (Supplementary 
Table S1). The protocol was similar to that briefly 
described by Hyten et al. 34 except for the number of 
oligonucleotides involved in a single DNA reaction, 
thus comprising 4608 custom oligos assembled in 
the oligo pooled assays (OPA) designed by lllumina 
Inc. Raw hybridization, intensity data processing, clus- 
tering, and genotype calling were performed using 
the genotyping module in the BeadStudio/ 
GenomeStudio package (lllumina, San Diego, CA, 
USA), lllumina has developed a self-normalization 
algorithm that relies on information contained in 
each array, as described by Akhunov et al. 35 

The clustering and genotype calling of each of the 
1 536 SNP markers were checked for their conformity 
and correct genotype distribution using known 
homozygous and heterozygous genotypes, included 
in the collection of diverse genotypes, as standards. 
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2.5. SSR in silico discovery and genotyping 

The MIcroSAtellite identification tool (MISA 
http://pgrc.ipk-gatersleben.de/misa) was used to 
perform SSR searches, and primers were designed 
with Primer3 software. 36 

Primers flanking microsatellite loci were designed at 
each end of the scaffolds to orient and anchor them 
to the genetic map. 

SSRs identified in the scaffolds were mapped only 
on the reference map established from the 
UPA402 x UF676 cross. 

A total of 409 primer pairs (Supplementary Table 
S2) were defined in the 1 00 larger non-anchored 
scaffolds using Primer3 software 36 and screened for 
their ability to segregate in the UPA402 x UF676 
progeny. 

For a given SSR locus, the forward primer was 
designed with a 5'-end M13 tail (5'-CACGACG 
TTGTAAAACG AC- 3 ') • PCR amplifications were per- 
formed in a Mastercycler ep384 thermocycler 
(Eppendorf, Germany) with 5 ng of purified DNA in a 
1 0 |xl final volume of buffer containing 1 0 mM Tris- 
HCI (pH 8), 50 mM KCl, 0.001% glycerol, 2.0 mM 
MgCl 2 , 0.08 |jlM of the M1 3-tailed forward primer, 
0.1 |xM of the reverse primer, 200 |xM of dNTP, 1 U of 
Taq DNA polymerase (Life Technologies, USA), 0.1 |xM 
of M1 3 primer-fluorescent dye 6-FAM™, NED®, VIC®, 
or PET® (Applied Biosystems, CA, USA). The DNA and 
buffer were distributed in 384 plates using a Biomek 
NX automatic pipetting robot (Beckman Coulter, CA, 
USA). The touchdown PCR programme used was as 
follows: initial denaturation at 95°Cfor 5 min, followed 
by 1 0 cycles at95°Cfor45 s, Tm of 56-46°C (- 1 °C/ 
cycle) for 1 min, and 72°C for 1 min 30 s. After these 
cycles, an additional round of 25 cycles were per- 
formed at 95°C for 45 s, Tm of 50°C for 1 min, and 
72°C for 1 min, with a final elongation step at 72°C 
for 30 min. 

PCR products were diluted specifically for each dye 
and pooled for multiplex SSR genotyping (revealing 
two SSRs having different sizes of amplified product 
per dye). A mixture of 1 B julI of Hi-Di™ formamide 
(Applied Biosystems) and 0.1 2 |J of size marker 
GeneScan™ 600-LIZ-Size® Standard V2.0. (Applied 
Biosystems) was added to 2 \i\ of the diluted PCR 
pool. This pool was then analysed using the ABI 
3500xL automatic sequencer (Applied Biosystems). 

Images were analysed using Genemapper 4.0 soft- 
ware (Applied Biosystems) and exported as a data 
table. 

2.6. Genetic mapping 

The UPA402 x UF676 population was the result of 
a cross between two heterozygous cocoa clones, 
UPA402 ($) an Upper-Amazon Forastero and 



UF676 (cf) a Trinitario. In this case, there were 
three segregation possibilities: loci that were homozy- 
gous for one parent and heterozygous for the other, 
segregation (1:1), and those that segregated in both 
parents (1 :2:1 or 1 :1 :1 :1 ). 

Segregations were checked for good ness-of -fit to 
the expected Mendelian ratio using a chi-square test 
at significance levels of 0.05 and 0.01 . 

Individual and consensus maps were constructed 
using Joinmap software, version 4.0. 37 

Joinmap is able to combine data of several segrega- 
tion types to construct a consensus genetic map. Here 
we used population type CP for the UPA402 x UF676 
map, and population type F2 for the F2 population 
from Brazil. A lod score of 6 was used to identify 1 0 
linkage groups (LGs) independently for each map. A 
consensus genetic map was established from the 
two distinct genetic maps. The corresponding groups 
were associated in pairs with JoinMap software. The 
Kosambi mapping function, with a lod score of 5 
and a jump threshold of 3, was used to convert 
recombination frequencies into map distances. 38 

This consensus map combined the new EST-SNPs 
and genomic SSRs defined from the scaffolds, in add- 
ition to the previously mapped markers. 9 This map 
contained only markers with a known nucleotide 
sequence. 

3. Results 

3.1 . Identification of SNPs and development of the 
golden gate assay 

The assembly made from the 1 49 650 T. cacao EST 
sequences (see Materials and Methods) generated 
1 2692 T. cacao contigs. The number of ESTs per 
contig ranged from 2 to 5102. To detect good 
quality in silico SNPs, we assumed that contigs with 
more than 100 members contained paralogous 
sequences. 1 3,39 We therefore first selected 4818 
contigs that contained at least 4 but no more than 
1 00 EST members. A total of 5246 SNPs were identi- 
fied in silico in 2012 contigs. 

We selected 4150 in silico SNPs detected in 1 834 
contigs that had a significant BlastX annotation simi- 
larity with known proteins of the NCBI non-redun- 
dant protein sequence database (NR) with entries 
from GenPept, Swissprot, PIR, PDF, PDB, and NCBI 
RefSeq, and as described in Argout et al. 32 

3.2. SNP performance and quality 

The set of 41 50 in silico SNPs was selected in the 
EST contigs and the SNP-harbouring sequences were 
then submitted to lllumina for processing using the 
lllumina® Assay Design Tool (ADT). ADT generates 
scores for each SNP that can range from 0 to 1 ; 
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SNPs with scores >0.6 have a high probability of 
being converted into a successful genotyping assay. 
In the set of 41 50 submitted SNPs, 83.5% showed a 
high conversion success rate (>0.6), 9.2% showed a 
moderate conversion success rate (between 0.4 and 
0.6), and 7.3% showed either a low conversion 
success rate or no score. A total of 1 536 SNP sites 
having ADT scores >0.4 and without any other SNPs 
within the adjacent 60 bp was selected for the OPA 
design (Supplementary Table S1 ). 

3.3. Analysis of base changes 

One thousand and forty-four in silico SNPs (68%) 
were transitions and 462 (32%) were transversions 
(Table 2). This ratio of transition/transversion SNPs 
tallies with the results observed in other plant 
species, where transition SNPs are always more fre- 
quent than transversion SNPs. 

3.4. SNP polymorphism 

From the 1 536 SNPs, 841 (55%) with a 
non-ambiguous polymorphic pattern across acces- 
sions were retained as true and verified SNPs and 
denominated TcSNP. Of the rest, 113 (7%) failed 
to be genotyped, 436 (28%) had a monomorphic 
pattern, and 146 (10%) were polymorphic but did 
not show any clear fluorescent pattern suitable for 
reliable genotype classification. 

Of the 841 polymorphic SNPs, 461 segregated in 
the mapping population (UPA402 x UF676) and 
could be mapped on the reference map. Five 
hundred and thirty-one were polymorphic and 
mapped on the F2 population map. Two hundred 
and thirty-nine SNP markers were segregating in 
both maps, thus enabling construction of a consensus 
map between them. 

3.5. SSR polymorphism 

A high-density genetic map is a key tool to order 
the scaffold assembly needed to generate a complete 
cocoa genome sequence. SSR markers were defined in 
the largest non-anchored scaffolds in order to 



Table 2. Nucleotide substitution types of the 1 536 selected in 
silico SNPs 



Types 


Number of SNPs 


Percentage 


Percentage 


A <-> C 


1 26 


8 


Transversion 32 


A 


1 28 


8 




C <-> C 


1 1 2 


7 




T <-> C 


1 26 


8 




T <-> G 


61 2 


40 


Transition 68 


A <-> G 


432 


29 





improve anchoring of the T. cacao genome assembly 
provided by the International Cocoa Genome 
Sequencing consortium 1 on the genetic map. 

From the 409 screened SSRs (Supplementary Table 
S2), 1 63 were polymorphic for the UPA402 x UF676 
progeny and could be mapped. 

The new SSR markers defined from scaffolds were 
named mTcCIR450 to mTcCIR61 3 to extend the pre- 
viously identified SSR marker series; mTcCIR 1 to 
mTcCIR 291 s from genomic DNA and mTcCIR 292 
to mTcCIR 447 9 from ESTs. 



3.6. Individual genetic linkage maps 

3.6.1. Map of the UPA402 x UF676 population 

A new set of 624 markers with their corresponding 
sequences, including 461 EST-SNP and 163 new SSR 
markers located on scaffolds of the genome assem- 
bly, 1 were added to the reference map (Fig. 1 , 
Supplementary Table S3). 

The new UPA402 x UF676 map contained 1043 
markers, including 461 EST-SNPs, 524 SSRs and 58 
RFLPs (Table 3). Of the 1043 markers, 571 corre- 
sponded to gene markers. The length of this map 
was 751.7 cM having an average distance of 0.7 cM 
between adjacent markers. 

Skewed segregation was observed for 1 1 8 markers 
(11.3%). The skewed markers were mainly located 
in LGs 2, 3, 6 and 1 0, as is shown in Fig. 3. 

3.6.2. Map of the F2 population 

The F2 map (Fig. 2, Supplementary Table S4) con- 
tained 531 EST-SNP markers. This map had a total 
length of 753.9 cM, with an average distance of 
1.4 cM between neighboring markers. The marker 
density varied, with an average distance between 
neighboring markers ranging from 0.9 cM in LG 9 to 
2.7 cM in LG7 (Table 4). 

Skewed segregation was observed in 97 markers 
(18.3%). The skewed markers were mainly located 
in LGs 1 , 3 and 4, as is shown in Fig. 3. 

3.7. Consensus genetic linkage map 

Two hundred and thirty-nine SNP markers were 
mapped in both populations. 

The complete consensus map (Table 5) contained 
1262 codominant markers including 681 EST-SNPs, 
523 SSRs (163 scaffold-tagged-SSRs, 110 EST-SSRs, 
250 SSRs from genomic DNA) and 58 RFLPs including 
1 4 resistance gene analogues (Rgenes-RFLPs), 
arranged in 1 0 LGs corresponding to the 
haploid chromosome number of T. cacao (Fig. 3, 
Supplementary Table S5). Among the 1 262 markers, 
65% were gene-based markers, including SNPs, SSRs 
and RFLPs. 
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Figure 1. Genetic map constructed from an F1 progeny of 264 individuals (located in CNRA, Cote d'lvoire) belonging to the UPA402 x 
UF676 cross. This map consists of 1 043 markers of a known DNA sequence (461 SNPs, 524 SSRs, and 58 RFLPs), spanning 752 cM. 
The average distance between two markers is 0.7 cM. The new markers added to this map are printed in red. 
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LG 


Length (cM) 


Total number of markers 


Average distance 
between markers (cM) 


SNP 


RFLP 


Genomic SSR 


SSR from scaffold 


EST-SSR 


LG1 


90.3 


1 50 


0.6 


69 


8 


33 


22 


1 8 


LG2 


97.5 


1 26 


0.8 


55 


6 


26 


28 


1 1 


LG3 


74.4 


1 26 


0.6 


57 


6 


33 


1 8 


14 


LG4 


75.6 


1 20 


0.6 


61 


1 1 


27 


1 4 


8 


LG5 


7 7 4 


1 2 1 


0 6 


D D 




34. 

D t 




-[ -| 


LG6 


62.7 


73 


0.9 


28 


2 


1 9 


1 2 


1 2 


LG7 


48.9 


51 


1.0 


1 1 


6 


1 6 


1 6 


2 


LG8 


60.3 


64 


0.9 


30 


1 


1 7 


5 


1 1 


LG9 


1 03.2 


1 54 


0.7 


78 


6 


34 


1 7 


20 


LG1 0 


61.3 


54 


1.1 


1 7 


6 


1 2 


1 6 


3 


Total 


751.7 


1 043 


0.7 


461 


58 


251 


1 63 


110 



SNP, single-nucleotide polymorphism; RFLP, restriction fragment polymorphism; SSR, simple sequence repeat; EST, expressed 
sequence tag. 



The total map length was 733.6 cM, i.e. slightly 
shorter than previously constructed maps (7 82.8 cM 
for Pugh et al. 8 and 779.2 cM for Fouet et al. 9 ). The 
average distance between adjacent markers on this 
map was 0.6 cM, and thus shorter than the 1.3 cM 
of the map of Fouet et al. 9 

The number of mapped loci varied substantially 
between LGs on the consensus map; from 63 in 
LG10 to 201 in LG1. The average distance between 
two markers in the different LGs ranged from 
0.4 cM in LG1 to 0.9 cM in LG1 0. 

In total, 844 new markers (681 SNP markers and 
163 SSR defined in scaffolds) were mapped. These 
new markers were well distributed over all chromo- 
somes allowing to fill some gaps in the previous 
maps, for example on chromosome 1 0. 



4. Discussion 

A large set of EST-SNP markers was generated and 
mapped in T. cacao. New SSR markers were added to 
these SNPs, providing an efficient tool for high- 
throughput genotyping of cocoa populations. 

SSR markers are multiallelic and well adapted for 
fine analysis of population diversity structure. 40-43 
In T. cacao, an average number of 5.8 alleles per SSR 
was observed by Loor Solorzano 44 after genotyping 
a collection of genetic resources of various genetic 
origins, and with a maximum of 1 5 alleles revealed 
by one SSR (mTcCIR322). This is not the case for 
SNPs that are only biallelic, but a higher number of 
SNP markers (several thousands) can be easily 
revealed at once using high-throughput technologies. 

We used our new SNP and SSR markers to construct 
a very high-density genetic map. Sixty-five per cent of 



the markers were from within genes and the average 
distance between adjacent markers was 0.6 cM. 

Several chromosome regions include markers with 
skewed segregations, particularly on LG 1 , LG 3, LG 
4, and LG 6. The region on LG 4 includes the locus 
for self-incompatibility previously identified by 
Crouzillat et al. 5 The gameto-sporophytic incompati- 
bility system existing in T. cacao 45,46 could possibly 
explain the segregation distortion on this LG 4 
region. Other factors which could explain segregation 
distortion, such as chromosome rearrangements in 
banana 47 which are responsible for highly skewed 
marker segregations, have not been reported in 
T. cacao. 

This high-density genetic map can be used as a 
major tool for efficient genome-wide association 
studies (GWASs) in T. cacao populations. This 
method, first applied in human and animal genet- 
ics, 48-51 was also found to be highly effective for 
studying the determinism of useful traits in 
plants, 52-56 particularly in cocoa with the analysis 
of some recent hybrid populations. 8,57,58 GWAS is an 
alternative to QTL analyses in cross progenies for the 
purpose of studying genetic control of phenotypic 
traits in cocoa. 

GWASs can be carried out on unrelated genetic 
resources such as wild or cultivated populations or 
germplasm collections. Large cocoa germplasm col- 
lections are maintained in many countries and char- 
acterized for useful traits. Two international cocoa 
collections are hosted at the International Cocoa 
Genebank, Trinidad (ICGT, preserving 2300 acces- 
sions), 59 and at the Centro Agronomico Tropical de 
Investigacion y Ensenanza, Turrialba, Costa Rica 
(CATIE, preserving 1 1 50 accessions). 60 The markers 
identified here will now certainly facilitate such 
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;ure 2. Genetic map constructed from an F2 progeny of 1 32 individuals (located at CEPLAC, Brazil) obtained by selfing of a single 
(Scavina 6 x ICS1) selected hybrid. This map consists of 531 SNP markers, spanning 754 cM. The average distance between two 
markers is 1 .4 cM. 
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Figure 3. Consensus map of (UPA402 x UF676) and F2 progenies. Markers segregating in both progenies are indicated in black, those 
segregating only in (UPA402 x UF676) are printed in green (previously mapped markers) and blue (newly mapped markers). 
Markers segregating only in the F2 progeny are printed in pink. This consensus map consists of 1 262 markers of a known DNA 
sequence, and it has a length of 734 cM. The average distance between two markers is 0.6 cM. Among the 1262 markers, 810 
correspond to markers defined in expressed genes. Significant skewed segregations are indicated by asterisks (*P< 0.05, **p< 0.01) 
or dots (F2 population). 
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GWASs, providing added-value to this wide character- 
ization work, thus boosting knowledge on the genetic 
determinants of useful cocoa traits. 

Another benefit of this large set of mapped markers 
is the possible integration of molecular information in 
conventional cocoa breeding schemes using marker- 
assisted selection (MAS). 

In cocoa, few MAS experiments are currently under- 
way. 3,61 The efficiency of MAS in selecting P. 
palmivora- resistant cocoa plants has been reported 
by Lanaud et al. 3 

Until now, MAS studies have mainly been focused 
on traits controlled by a small number of genes, 
using only markers close to QTLs. However, such 
methods are of limited use for traits that are 
determined by a large number of genes of small 
effects. 



Table 4. Distribution of SNP markers in the LCs of the F2 map 
(SCA6 x ICS1) selfing 



LG 


Length 
(cM) 


Number of SNP 
markers 


Average distance between 
markers (cM) 


LC1 


98.6 


1 04 


0.9 


LC2 


1 02.8 


74 


1.4 


LC3 


78.7 


73 


1.1 


LC4 


69.1 


49 


1.4 


LC5 


85.4 


56 


1.5 


LG6 


71.0 


28 


2.5 


LG7 


48.7 


19 


2.6 


LG8 


43.3 


28 


1.5 


LG9 


1 04.9 


78 


1.3 


LG1 0 


51.5 


22 


2.3 


Total 


753.9 


531 


1 .4 



SNP, single-nucleotide. 



Substantial genome-wide molecular data can now 
be generated at lower cost by high-throughput tech- 
nologies, such as SNP genotyping. This progress has 
paved the way for the development of new methods 
to predict genotype value via MAS. The genome- 
wide selection or genomic selection (GS) method 
was recently successfully applied in animal or plant 
breeding 62-66 and allows to predict phenotypes 
using all marker information. 

The integration of molecular markers in cocoa 
recurrent breeding programmes 67-72 could be 
facilitated by the GS approach in order to accelerate 
genetic gains. The GS strategy seems particularly suitable 
for the selection of multigenic traits such as yield and 
disease resistance. Cumulating a large number of resist- 
ance alleles is one of the main objectives of cocoa breed- 
ing for sustainable cocoa resistance. The large set of 
available SNP markers could facilitate the selection of re- 
sistant and high yielding cocoa trees via GS approaches 
enabling the use of all genome regions tagged by SNP 
markers, even those with very small effects. 

The search for candidate genes underlying trait 
variation is another major challenge for plant 
biologists, with the aim of gaining further insight into 
the mechanisms underlying trait variation, and produ- 
cing tools to efficiently screen and exploit genetic 
resources. 

The consensus map produced in this work has been 
used efficiently for anchoring an assembly of T. cacao 
Criollo genome sequences, and for constituting pseu- 
domolecules. 1 Recently, two different cocoa varieties, 
i.e. Criollo 1 and Forastero from the Lower 
Amazon region (http://www.cacaogenomedb.org/), 
were sequenced, with 28 798 and 35 000 annotated 
genes, respectively. These sequences will greatly 
facilitate the identification of candidate genes, allowing 



Table 5. Distribution of each marker type in the LGs of the consensus genetic map 



LG 


Length (cM) 


Total number of 
markers 


Average distance 
between markers (cM) 


SNP 


RFLP 


Genomic SSR 


SSR from scaffold 


EST-SSR 


LG1 


77.1 


201 


0.4 


1 20 


8 


33 


22 


1 8 


LG2 


1 01 .1 


1 56 


0.6 


85 


6 


26 


28 


1 1 


LG3 


76.9 


1 62 


0.5 


91 


6 


33 


1 8 


14 


LG4 


64.2 


1 35 


0.5 


75 


1 1 


27 


14 


8 


LG5 


78.1 


147 


0.5 


81 


6 


34 


1 5 


1 1 


LG6 


64 


81 


0.8 


36 


2 


19 


1 2 


1 2 


LG7 


52.6 


62 


0.8 


22 


6 


1 6 


1 6 


2 


LG8 


59.2 


73 


0.8 


39 


1 


1 7 


5 


1 1 


LG9 


100.9 


1 82 


0.6 


106 


6 


33 


1 7 


20 


LG1 0 


59.5 


63 


0.9 


26 


6 


1 2 


1 6 


3 


Total 


733.6 


1262 


0.6 


681 


58 


250 


1 63 


110 



SNP, single nucleotide polymorphism; RFLP, restriction fragment polymorphism; SSR, simple sequence repeat; EST, expressed 
sequence tag. 
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integration of both genetic and genomic (functional 
and structural) data. Overall, about 300 QTLs or 
marker/trait associations have already been identified 
in T. cacao. High-throughput genotyping associated 
with a high marker density will facilitate fine mapping 
of genes involved in trait variation (with GWAS or clas- 
sical QTL analyses conducted on large progenies), 
thus allowing to refine the QTL position in the 
genome, while facilitating the search for candidate 
genes in corresponding genome sequences. Several 
functional studies have already been conducted in 
cocoa, focused mainly on genes generally expressed in 
specific physiological conditions or metabolisms. 73 It 
will be now possible to focus more specifically on the 
expression of genes directly responsible for trait vari- 
ation after candidate gene validation. 

Analysing genome evolution during domestication 
processes or adaptation to climate change can also 
help us to identify key genes underlying adaptive 
traits. 74 Loss of diversity generally occurs during 
genome evolution, and some genes are selectively 
involved in natural selection or domestication. A large 
set of SNPs defined in expressed genes, such as those 
identified in this study, provides a key tool for identifying 
selection signatures or adaptive substitutions, and then 
highlighting candidate genes potentially involved in 
the adaptation 60 or domestication processes and their 
corresponding molecular functions. 75 All SNPs reported 
in this paper were identified in orthologous genes or 
gene families, thus facilitating comparative genomic 
approaches, and benefiting from gene knowledge accu- 
mulated in other species to accelerate cocoa breeding. 

5. Availability 

Information on the consensus linkage map, 
molecular markers, and primers are available in the 
Map Study 'SSR_SNP_consensus_map' of the cocoa 
module of TropGeneDB database (http://tropgen 
edb.cirad.fr). 
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